Project 3: Aim 1

Prediction of Gene Function using Phylogenetic Trees
IMAGE Retreat 2023

George G. Vega Yon
Paul Thomas
Paul Marjoram
Huaiyu Mi
Christopher Williams

2023-06-05

You can download the slides from https://ggv.cl/image-retreat2023

Recap

Gene Function: the Gene Ontology Project

  • The GO project has \(\sim\) 43,000 validated terms, \(\sim\) 7.4M annotations on \(\sim\) 5,200 species.

  • About \(\sim\) 700,000 annotations are on human genes.

  • Only half of the human gene annotations are based on experimental evidence.

  • About \(\sim\) 173,000 publications have used the GO.

  • GO annotations provide valuable information for many downstream applications, e.g., contextualizing -omics analyses.

source: Statistics from http://pantherdb.org/panther/summaryStats.jsp and http://geneontology.org/stats.html

Gene Functions

Molecular Function
Active transport GO:0005215

Cellular Component
Mitochondria GO:0004016

Biological Process
Heart contraction GO:0060047

Predicting Function: State-of-the-art

Sequences, phylogenomics, and ML.

  • aphylo (by yours truly): Another phylo-based method. Leverages negative annotations and pooled trees (G. G. Vega Yon et al. 2021).
  • DeepGOPlus: One of the top-performing models in the literature, uses a 2D convolutional neural network on gene sequences(Kulmanov and Hoehndorf 2019).
  • GOLabeler: Top performing tool according to the Critical Assessment of Function Annotation [CAFA] challenge is an ensemble of various simple ML methods, including K-means and logistic regression (You et al. 2018).
  • DeepFRI: Uses Graph Convolutional Neural Networks (GCNs) to predict function based on protein structure and genetic sequence (Gligorijević et al. 2021).

Starting point

“Bayesian Parameter Estimation for Automatic Annotation of Gene Functions Using Observational Data and Phylogenetic Trees” – (G. G. Vega Yon et al. 2021)

using phylogenetic trees to predict gene function.

  • Simple model using Felsenstein’s pruning algorithm.
  • Assumes that genes’ functions evolve independently.

Models in the project

The key difference between the models is how they model the transition from parent to offspring: \(\mathbb{P}\left(x_n\to x_o\right)\)

aphylo

  • Fixed gain/loss rates.

  • Full independence between genes/functions.

aphylo2

  • Event-specific gain/loss rates.

  • Full independence between genes/functions.

GEESE

  • Event-specific gain/loss rates.

  • Jointly distributed model.

Evol. of Gene fun. (multiple functions) Tapping into Evol. Theory

  • A fundamental part of Fun. Evol. is Duplication Events.
  • Furthermore, knowing what happened to gene A (e.g., neofunctionalization) is highly informative to learn about the functional state of B.

A key part of molecular innovation, gene duplication provides an opportunity for new functions to emerge (wikimedia)

Sufficient statistics

Characterizing the evolution of gene function using sufficient statistics.

GEESE: GEne functional Evolution using SuficiEncy

Model fully implemented in C++ and R. It already shows great promise:

Challenges

Rough edges

We have only discussed Sub Aim 2

  • Sub Aim 1 is supposed to deal with the Hierarchical Bayesian Framework.

  • Some work is done, but we need a leader for this.

Data challenges

  • Most trees aren’t fully-reconciled: GEESE’s complexity grows exponentially with the number of offspring in the tree.

  • Negative assertions (gene NOT associated with) are rare… but Christopher has made good progress using taxon constraints.

  • Comparing with other methods: Very difficult because of the lack of reproducibility.

Discussion

Goals

Manuscripts

  • The GEESE paper is about 80% done. We need towill finish it this year.
  • Low-hanging fruit: aphylo2 can be submitted to PLOS Comp. Bio. using aphylo (a software prototype is up and running).

Collaboration (ideas)

  • Augment -omics data with GEESE (à la carte): Use GEESE on a gene list to make predictions, then use those predictions as additional -omics data (i.e., beyond ad-hoc post-analysis).
  • Or use predictions from other projects as features in GEESE.

Bonus: Mechanistic ML (prelim res.)

Not mentioned in the original grant, but we could add it to (any) the project.

References

Altschul, Stephen F., Warren Gish, Webb Miller, Eugene W. Myers, and David J. Lipman. 1990. “Basic Local Alignment Search Tool.” Journal of Molecular Biology 215 (3): 403–10. https://doi.org/10.1016/S0022-2836(05)80360-2.
Engelhardt, B. E., M. I. Jordan, K. E. Muratore, and S. E. Brenner. 2005. Protein Molecular Function Prediction by Bayesian Phylogenomics.” PLOS Computational Biology 1 (5). https://doi.org/10.1371/journal.pcbi.0010045.
Engelhardt, B. E., M. I. Jordan, J. R. Srouji, and S. E. Brenner. 2011. “Genome-Scale Phylogenetic Function Annotation of Large and Diverse Protein Families.” Genome Research 21 (11): 1969–80. https://doi.org/10.1101/gr.104687.109.
Gligorijević, Vladimir, P. Douglas Renfrew, Tomasz Kosciolek, Julia Koehler Leman, Daniel Berenberg, Tommi Vatanen, Chris Chandler, et al. 2021. “Structure-Based Protein Function Prediction Using Graph Convolutional Networks.” Nature Communications 12 (1): 3168. https://doi.org/10.1038/s41467-021-23303-9.
Jain, Aashish, and Daisuke Kihara. 2019. “Phylo-PFP: Improved Automated Protein Function Prediction Using Phylogenetic Distance of Distantly Related Sequences.” Bioinformatics 35 (March): 753–59. https://doi.org/10.1093/bioinformatics/bty704.
Kulmanov, Maxat, and Robert Hoehndorf. 2019. DeepGOPlus: Improved Protein Function Prediction from Sequence.” Edited by Lenore Cowen. Bioinformatics, July, btz595. https://doi.org/10.1093/bioinformatics/btz595.
Vega Yon, G. G., D. C. Thomas, J. Morrison, H. Mi, P. D. Thomas, and P. Marjoram. 2021. “Bayesian Parameter Estimation for Automatic Annotation of Gene Functions Using Observational Data and Phylogenetic Trees.” PLoS Comput Biol 17: e1007948. https://doi.org/10.1371/journal.pcbi.1007948.
Vega Yon, George G., Mary Jo Pugh, and Thomas W. Valente. 2022. “Discrete Exponential-Family Models for Multivariate Binary Outcomes.” arXiv. https://arxiv.org/abs/2211.00627.
You, Ronghui, Zihan Zhang, Yi Xiong, Fengzhu Sun, Hiroshi Mamitsuka, and Shanfeng Zhu. 2018. GOLabeler: Improving Sequence-Based Large-Scale Protein Function Prediction by Learning to Rank.” Edited by Jonathan Wren. Bioinformatics 34 (14): 2465–73. https://doi.org/10.1093/bioinformatics/bty130.
Altschul, Stephen F., Warren Gish, Webb Miller, Eugene W. Myers, and David J. Lipman. 1990. “Basic Local Alignment Search Tool.” Journal of Molecular Biology 215 (3): 403–10. https://doi.org/10.1016/S0022-2836(05)80360-2.
Engelhardt, B. E., M. I. Jordan, K. E. Muratore, and S. E. Brenner. 2005. Protein Molecular Function Prediction by Bayesian Phylogenomics.” PLOS Computational Biology 1 (5). https://doi.org/10.1371/journal.pcbi.0010045.
Engelhardt, B. E., M. I. Jordan, J. R. Srouji, and S. E. Brenner. 2011. “Genome-Scale Phylogenetic Function Annotation of Large and Diverse Protein Families.” Genome Research 21 (11): 1969–80. https://doi.org/10.1101/gr.104687.109.
Gligorijević, Vladimir, P. Douglas Renfrew, Tomasz Kosciolek, Julia Koehler Leman, Daniel Berenberg, Tommi Vatanen, Chris Chandler, et al. 2021. “Structure-Based Protein Function Prediction Using Graph Convolutional Networks.” Nature Communications 12 (1): 3168. https://doi.org/10.1038/s41467-021-23303-9.
Jain, Aashish, and Daisuke Kihara. 2019. “Phylo-PFP: Improved Automated Protein Function Prediction Using Phylogenetic Distance of Distantly Related Sequences.” Bioinformatics 35 (March): 753–59. https://doi.org/10.1093/bioinformatics/bty704.
Kulmanov, Maxat, and Robert Hoehndorf. 2019. DeepGOPlus: Improved Protein Function Prediction from Sequence.” Edited by Lenore Cowen. Bioinformatics, July, btz595. https://doi.org/10.1093/bioinformatics/btz595.
Vega Yon, G. G., D. C. Thomas, J. Morrison, H. Mi, P. D. Thomas, and P. Marjoram. 2021. “Bayesian Parameter Estimation for Automatic Annotation of Gene Functions Using Observational Data and Phylogenetic Trees.” PLoS Comput Biol 17: e1007948. https://doi.org/10.1371/journal.pcbi.1007948.
Vega Yon, George G., Mary Jo Pugh, and Thomas W. Valente. 2022. “Discrete Exponential-Family Models for Multivariate Binary Outcomes.” arXiv. https://arxiv.org/abs/2211.00627.
You, Ronghui, Zihan Zhang, Yi Xiong, Fengzhu Sun, Hiroshi Mamitsuka, and Shanfeng Zhu. 2018. GOLabeler: Improving Sequence-Based Large-Scale Protein Function Prediction by Learning to Rank.” Edited by Jonathan Wren. Bioinformatics 34 (14): 2465–73. https://doi.org/10.1093/bioinformatics/bty130.